As a reminder, the dataset to be analyzed contains 21894 features
(genes) over 34950 cells distributed in 6 samples from two conditions.
In this case, ribosomal genes were excluded from the dataset. Low
quality cells were also discarded as a result of a previous QC.
Normalization and dimensionality reduction (PCA, UMAP)
Normalization aims to remove technical factors such as the library
size that may confound the real biological heterogeneity. In this case,
during normalization, mitochondrial content is also removed as a
possible source of variation in our samples.
Once data is normalized, next step is to proceed with dimensionality
reduction, this is required for visualization purposes and ease
clustering. Dimensionality reduction is completed using two methods:
first PCA and secondly, UMAP, considering a specific number of PCs from
PCA.
In order to choose a proper number of PCs to be consider for
downstream analysis, those genes defining PCs can be checked as
following indicated for the first 10 PCs (limited to 10 genes):
## PC_ 1
## Positive: Myl7, Acta2, Tnnt2, Ttn, Tpm1, Cdkn1c, Myl4, Tmsb4x, Myl3, Vim
## Negative: Dppa5a, Hbb-bh1, Mt1, Hba-x, Hba-a1, Mkrn1, Chchd10, L1td1, Mt2, Trh
## PC_ 2
## Positive: Hbb-bh1, Hba-x, Hba-a1, Hba-a2, Hbb-bs, Hbb-y, Slc25a21, Reln, Fth1, Gypa
## Negative: Dppa5a, Mt1, Mkrn1, Trh, L1td1, Pou5f1, Chchd10, Tdh, Mt2, Zfp42
## PC_ 3
## Positive: Myl7, Tnnt2, Ttn, Myl4, Myl3, Slc8a1, Acta2, Actc1, Dppa5a, Cdkn1c
## Negative: Ctla2a, Vim, Krt8, Krt18, Peg3, Tmsb4x, Ramp2, S100a10, Spink1, Fn1
## PC_ 4
## Positive: Spink1, Krt18, Krt8, Emb, Tagln, Slc2a3, Car4, Tpm1, Hba-x, Ttr
## Negative: Ctla2a, Ramp2, Hapln1, Vim, Fli1, Cdh5, Ctla2b, Ptprm, Rhoj, Egfl7
## PC_ 5
## Positive: Phlda2, Hand1, Pmp22, Unc5c, Prdm6, Pcsk5, Mesp1, Cdh11, Gpc6, Gpc3
## Negative: Ctla2a, Spink1, Tmsb4x, Emb, Ramp2, Krt18, Flt1, Slc2a3, Apoe, Malat1
## PC_ 6
## Positive: Acta2, Tagln, Tmsb4x, Tpm1, Krt8, Ctla2a, Krt18, Vim, Hand1, Ahnak
## Negative: Malat1, Prkg1, Lsamp, Gm42418, Gpc6, Auts2, Gpc3, Meis2, Maml3, Lhx1
## PC_ 7
## Positive: Tmsb4x, Acta2, Tagln, Dppa5a, Igfbp5, Igf2, Airn, Csrp2, Ahnak, Tpm1
## Negative: Spink1, Myl7, Emb, Fgf8, Tnnt2, Ttr, S100a10, Car2, Myl4, Mixl1
## PC_ 8
## Positive: Tagln, Fgf8, T, Tmsb4x, Fst, S100a6, Sprr2a3, Lefty1, Psmb8, Sfn
## Negative: Spink1, Ttr, Unc5c, Dppa5a, Emb, Car4, Fth1, Hand1, Airn, Rbp4
## PC_ 9
## Positive: Dppa5a, Hba-x, Hba-a1, S100a10, Gpc3, Ifitm1, Pcdh7, Phlda2, Pcsk5, Peg3
## Negative: Igfbp5, Reln, Ccnd2, Prkar2b, Fgf3, Pax9, Vgll2, Fth1, Lmo2, Prkca
## PC_ 10
## Positive: Dppa5a, Igfbp5, Tmsb4x, Lsamp, Lhx1, Emb, Tmsb10, Ctla2a, Gpc3, Meis2
## Negative: Malat1, Unc5c, 2410006H16Rik, Gm42418, Ftl1-ps1, Magi2, Exoc4, Anks1b, Ddit3, Ankrd12
Another useful plot is the Elbow plot which ranks PCs based
on their percentage of variance explained by each one.

An elbow is observed (approx) between 25-35 PCs, suggesting
that the majority of true signal is captured in those first PCs. On the
other hand, to achieve a cumulative percentage of captured variation
greater than 85%, 38 PCs are required. Based on this, a total of
38 are considered for running UMAP.
The resulting UMAP layout distinguishing by samples condition is:

This representation shows a great overlapping between conditions,
this can be interpreted as a good alignment among samples. Thus, it is
concluded that there is no need for samples
integration. If this plot is generated per sample, the same
conclusion is kept:

It is recommended to visualize UMAP against other possible
confounding variables i.e. mitochondrial content or number of
genes detected per cell. This is shown in next figures:

Finally, cell phase can be inferred from a reference set of genes
defined by (Tirosh et al. 2016) and phase
labels represented in UMAP, as shown in following figure:

Since this dataset includes developing cells, it is not recommended
to regress out this information.